candle-core: add Device::enable_peer_access for cross-CUDA P2P transfers#3525
Open
toddwbucy wants to merge 1 commit into
Open
candle-core: add Device::enable_peer_access for cross-CUDA P2P transfers#3525toddwbucy wants to merge 1 commit into
toddwbucy wants to merge 1 commit into
Conversation
`Tensor::to_device(&other_cuda_device)` dispatches to
`CudaStorage::transfer_to_device` → `cudarc::CudaStream::clone_dtod` →
`memcpy_peer_async` when the two devices have different contexts. But
`memcpy_peer_async` requires `cuCtxEnablePeerAccess` to have been
called between the two contexts first; without it the driver rejects
with `CUDA_ERROR_INVALID_CONTEXT`. Result: any GPU-direct cross-
`CudaDevice` transfer fails on first use.
Add explicit opt-in at the public `Device::enable_peer_access(&other)`
API:
- `CudaDevice::enable_peer_access(&self, other: &Self)` calls
`cuCtxEnablePeerAccess` in both directions (self ←→ other), bound
to the appropriate context for each direction.
- `Device::enable_peer_access(&self, other: &Self)` (cuda-only)
delegates to `CudaDevice` when both are CUDA, errors otherwise.
Idempotent in two senses:
- Same-ordinal pairs are a no-op (`Ok(())`).
- Repeat calls between the same context pair are safe; the driver's
`CUDA_ERROR_PEER_ACCESS_ALREADY_ENABLED` is folded into `Ok(())`
inside the helper.
Operators must call this before doing cross-card transfers; we do
not auto-enable in `BackendDevice::new` because that requires global
state tracking of all already-constructed CudaDevices, which would
be a much larger surgery.
Tested: discovered during a topology bench (encoder GPU0 ↔ decoder
GPU1, NVLink-bridged) where `Tensor::to_device` errored with
INVALID_CONTEXT. With this patch + a one-time
`encoder_device.enable_peer_access(&decoder_device)?` call before
the first transfer, the cross-card path succeeds and routes through
the NVLink bridge.
Builds: `cargo build -p candle-core --features cuda` — clean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #3524.
Adds explicit `Device::enable_peer_access(&other)` so GPU-direct cross-card tensor operations (`Tensor::to_device(&other_cuda_device)`) succeed instead of erroring with `CUDA_ERROR_INVALID_CONTEXT` on first use.
What
Idempotent
Why explicit, not auto-enable in `BackendDevice::new`
Per #3524 I considered (A) explicit opt-in vs (B) opportunistic auto-enable in `BackendDevice::new`. Going with (A) here:
Happy to switch to (B) instead if maintainers prefer that direction; the underlying `enable_peer_access_one_way` helper is the same either way.
Errors
Returns the underlying `DriverError` if the device pair doesn't support peer access (some heterogeneous or IOMMU-isolated configurations) or if either context is in a terminal state. Callers wanting to probe support before attempting can use `cuDeviceCanAccessPeer` separately; not wrapped here.
Validation
Adjacent
This is the third small fix from our Jina V4 + multi-GPU work. The other two are #3520 (qwen2 RoPE fp32 cos/sin tables) and #3521 (FA v2.8.3 vendored kernel bump). All three were independently discovered while bringing a Jina V4 embedder up under candle on a 2-A6000 rig; happy to provide more context on any of them.